Computer simulations of language change notes

This website collects my personal notes on Computer simulations of language change. These notes are provided to bring full transparency to my research process. Of course, since they are only notes, they do not reflect my final thoughts on a topic, and should not be interpreted as such. To read finished papers, please consult my website. Do not use these notes as a basis for your own scientific research. Start from high-quality, peer-reviewed scientific literature instead.

Modeling the Complexity and Descriptive Adequacy of Construction Grammars

Complexity and Descriptive Adequacy

construction grammars

operate at multiple levels of representation (lexical, syntactic, and semantic)
much more complex than purely syntactic grammars

this paper

👁 ↓

the computational complexity of CxGs
- = complexity
the descriptive adequacy against unannotated corpora using Minimum Description Length (MDL)
- descriptive adequacy

These two measures, complexity and descriptive adequacy, can be used together as an objective function for measuring the quality of CxGs: the optimum grammar balances higher descriptive adequacy against lower complexity.

The goal is not to examine the representational capacity of CxGs in general because CxG is a fundamentally usage-based paradigm (Hopper, 1987; Kay & Fillmore, 1999; Bybee, 2006). This means that the general capacity of its grammars must be weighted by their actual content: how can we model the complexity of a specific CxG used to describe a specific language, where that language is represented by a specific observable corpus?

Previous computational work on CxG (Steels, 2004; Bryant, 2004; Chang, et al., 2012; Steels, 2012) has relied on introspection-based representations that require a linguist to determine the optimum constructions by intuition. From a linguistic perspective, these representations are neither replicable nor falsifiable and are unable to test hypotheses about the mechanisms of emergence that map from observed usage to learned generalizations. From a computational perspective, these representations are not scalable across domains and languages and are subject to all the constraints of knowledge-based systems. Other data-driven approaches (Wible & Tsao, 2010; Forsberg, et al., 2014) generate potential constructions but do not evaluate the quality of competing CxGs as collections of constructions.

Representing CxGs

Types of representations

Representations recognised by the algorithm

1. lexical representations

consist of tokenised word forms (in lowercase)

2. syntactic representations

consist of part-of-speech categories
Universal POS tagset

3. semantic representations

consist of clusters of distributionally similar words
represent semantic domains, computed using GenSim’s implementation of word2vec (Rehurek & Sojka, 2010)

The embedding model is trained using 1 billion words from web-crawled corpora for each language (from the WAC corpora: Baroni, et al., 2009; and Aranea corpora: Benko, 2014) using skip-grams with 500 dimensions. These embeddings are segmented into categorical domains using k-means clustering (k = 100). The idea behind these three types of representation is that a particular slot in a construction can be defined or constrained at the lexical, syntactic, or semantic level. These representations thus form the basic alphabet of the algorithm.

What are constructions?

constructions

sequences containing a certain number of slots

(1a) [slot 1 — slot 2 — slot 3 — slot 4]

↳ each construction

surrounded by brackets and each slot within a construction is separated by a dash (“—”)

↳ each slot in a construction

represented or defined by constraints that govern which units can occupy that slot

Types of constraints

(1b) [NOUN — “gave” — (animate) — “a hand”]
(1d) [NOUN — (transfer) — (animate) — NOUN]

↳ lexical constraints

indicated using single quotes (e.g. “gave”)

↳ syntactic constraints

indicated using part-of-speech tags in uppercase (e.g., noun)

↳ semantic constraints

indicated within parentheses with the identifier for the semantic domain (e.g. animate)

↓

(1c) “Bill gave Peter a hand.”
(1e) “Bill sent Peter a package.”

The construction in (1b) describes the utterance in (1c) but not the utterance in (1e); the construction in (1d) describes the utterances in both (1c) and (1e).

↓

This provides a good example of the complexity problem: CxGs potentially have multiple overlapping representations for any given sentence. The sentence in (1e), for example, can be represented by the construction in (1d), in which slots are defined by both syntactic constraints (i.e., NOUN) and semantic constraints (i.e., animate). CxGs can distinguish between (1e) and its more idiomatic counterpart (1c) using representations such as (1d) and (1b). The question, however, is how many of these item-specific or idiomatic representations are needed in the grammar: each item-specific construction increases grammar complexity.

Writing a grammar

In this paper

construction

the grammatical description (e.g. 1b)

construct

a member of the set of utterances which that construction represents (e.g. 1c)

For a given grammar, the set of constructions is closed but the set of constructs is open. A construct or utterance can be represented by multiple constructions: representations like (1b) that are more item-specific alongside representations like (1d) that are more schematic. This leads to relationships between constructions: an inheritance hierarchy in which (1b) is a child of (1d).

Limitations

constraints are limited to a single type of representation per slot
- if a slot is constrained to the semantic domain animate, any syntactic category could be used to fill that slot
although constituents are able to fill construction slots (i.e., “a hand” can occupy a single slot as a single NOUN), larger constructions such as (1d) cannot fill slots in other constructions
no relations are learned between constructions in the grammar (i.e., the inheritance hierarchy is not modelled)

Computational representation

construction

an array of slots
each slot = a tuple with two pointers
1. pointer to the alphabet constraining that slot (i.e. lexical/syntactic units)
2. pointer to a particular unit within that alphabet (i.e. “a hand” or NOUN)

slot-fillers

can be constituents
accomplished using a context-free phrase structure grammar containing rules such as 👁 ↓

Herschrijfregels

NOUN → DETERMINER + NOUN

The CFG is learnt during a syntax-only iteration described in the next section.

The syntactic alphabet, then, also supports pointers to complex sequences through this CFG: the construction points to a NOUN and the CFG allows larger constituents to be labeled as a single NOUN. The current implementation produces a context-free CxG

Finding CxGs

Building the grammar

data source for each language

a large web-crawled corpus in that language
grammar is learned by searching across potential grammars, each of which is evaluated against the corpus until the optimum grammar is found
- measure → 👁 ↓

The search for the optimum grammar is conducting using a tabu search (Glover 1989, 1990a) with multi-unit association measures (Dunn, 2017) used to sample potential constructions.

↳ objective function

very important
how can we know that one grammar is better than another without evaluating them against gold-standard annotations?

Levels

Three levels of CxGs are learnt

1. CxG_LEX

operates only on lexical representations
identifies purely lexical constructions: sequences of lexical items that have been fused together so that their internal structure can be ignored
e.g. could be and will be

2. CxG_SYN

operates only on syntactic representations
later used as phrase structure rules
e.g. [VERB - NOUN] is identified as a purely syntactic construction

3. CxG_FULL

operates on all levels of representation
constructions output from a previous pass become atomic units in the current pass

Measuring grammar quality

“grammar”

a method for encoding observed linguistic utterances
the learner is searching for the smallest adequate encoding method
⇒ can the grammar produce the utterances observed in held-out test-sets?

“optimal grammar”

balances model complexity (the number and type of constructions in the grammar) and the amount of compression achieved when the model is used to encode a test corpus

M D L=\min _G\left\{L_1(G)+L_2(D \mid G)\right\}

the optimum grammar is the grammar which minimises model complexity (represented by the encoding size of the grammar) and the size of the dataset encoded by means of the grammar

Encoding size in MDL

<span class="todo rendered">TODO</span>

Comparing grammar models

<span class="todo rendered">TODO</span>

to what degree is $G_A$ better than $G_B$ ?

Relevance of Minimum Description Length

why?

does not rely on gold-standard annotations

↓

we do not necessarily have gold-standard annotations for every language and language variety we are interested in
simply relying on gold-standard annotations ignores the question we are most interested in: how do we know empirically that one grammar is better than another?

Measuring the encoding size of CxGs

???

Results and discussion

Computer simulations of language change notes